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Abstract 

A function is said to be additive if, similar to mutual information, expands by a factor 
of n, when evaluated on n i.i.d. repetitions of a source or channel. On the other hand, a 
function is said to satisfy the tensorization property if it remains unchanged when evaluated on 
i.i.d. repetitions. Additive rate regions are of fundamental importance in network information 
theory, serving as capacity regions or upper bounds thereof. Tensorizing measures of correlation 
have also found applications in distributed source and channel coding problems as well as the 
distribution simulation problem. Prior to our work only two measures of correlation, namely the 
hypercontractivity ribbon and maximal correlation (and their derivatives), were known to have 
the tensorization property. In this paper, we provide a general framework to obtain a region with 
the tensorization property from any additive rate region. We observe that hypercontractivity 
ribbon indeed comes from the dual of the rate region of the Gray-Wyner source coding problem, 
and generalize it to the multipartite case. Then we define other measures of correlation with 
similar properties from other source coding problems. We also present some applications of our 
results. 


1 Introduction 


Additivity is a fundamental property of interest in information theory (e.g., see 00 ) since capacity 
regions by their operational definition are additive for product of identical channels or sources. 
Tensorization is another important property of regions in information theory which in this paper we 
interpret as the dual of additivity problem. Let us explain the notions of additvity and tensorization 
via the example of non-interactive distribution simulation 
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Fix some bipartite distribution pxy- Suppose that two parties, Alice and Bob, are given i.i.d. 
samples X n and Y n respectively, and they are asked to output A and B (respectively) distributed 
according to some predetermined qab- Alice and Bob can choose n to be as large as they want, 
but are not allowed to communicate. The problem of deciding whether this task is doable or not 
is a hard problem in general. Nevertheless, we may obtain impossibility results using the data 
processing inequality. 

Suppose that I(X n ;Y n ) < I(A;B). In this case by the data processing inequality local trans¬ 
formation of (X n ,Y n ) to (A, B) is infeasible. However, note that mutual information is additive , 
i.e., we have I(X n \Y n ) = n ■ I(X\Y ). Then, unless X and Y are independent, by choosing n to 
be large enough, I{X n \ Y n ) becomes as large as we want and greater than /(A; B). Therefore, the 
data processing inequality of mutual information does not give us any useful bound on this problem, 
simply because mutual information is additive. 

Now suppose that there is some function />(-,•) of bipartite distributions that similar to mutual 
information satisfies the data processing inequality, but is not additive. More precisely, suppose 
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that 


p(X n ,Y n ) = p(X,Y). 

That is, p(-, •) extremely violates additivity and satisfies the above equation which is called the 
tensorization property. Given such a measure and following the previous argument we find that local 
transformation of ( X n , Y n ) to (A, B ) is impossible (even for arbitrarily large n ) if p(X, Y ) < p(A, B). 

In the above example we see how tensorization naturally appears as a tool to solve information 
theoretic problems. In the following by giving some examples we clarify the notions of additivity 
and tensorization and then explain our results. 

1.1 Additivity 

Capacity regions by their operational definition are additive for product of identical channels or 
sources since they are expressed as a limit of multi-letter instances of the problem as the blocklength 
goes to infinity. For instance, consider the capacity of a point to point channel: 

C(p(y\x)) = ma xI(X'Y). 

p(x) 

By its operational definition, the capacity of a product of identical channels is equal to the sum of 
the capacities of the individual channels 

C(p(yi\xi)p(y 2 \x 2 )) = C(p(yi|xi)) +C(p{y 2 \x 2 )). 

This is called the additivity property of the channel capacity. 

Defining additivity for general network information theory problems, involving relay and feed¬ 
back is more involved |2|, but for one-hop networks, when we are dealing with a rate region TZ(-), 
we say that it is additive if 


n(p xp) = K{p) + K(p), (1) 

where p is the underlying channel or joint distribution and + is the Minkowski sum (point-wise 
sum). 

Additive regions are of fundamental importance to network information theory, not only because 
of the additivity of capacity regions, but also because the known upper bounds on capacity regions 
are additive. 


1.2 Tensorization 


Tensorization has received relatively less attention comparing to additivity. The simplest example 
to illustrate the definition and applications of tensorization is via Witsenhausen’s extension |3| of the 
Gacs-Korner common information |4j. Assume that Alice and Bob are observing i.i.d. repetitions of 
random variables X n and Y n . Their goal is to extract common randomness via functions /(•) and 
g(-) such that with high probability f(X n ) = g(Y n ). Gacs and Korner show that unless X = ( C , X') 
and Y = (C, Y r ) for some explicit common part C, the rate of common randomness extraction is 
zero. This result was strengthened by Witsenhausen, who showed that if X and Y do not have any 
explicit common part, it is not possible for Alice and Bob to extract even a single common random 
bit. This was shown by utilizing a measure of correlation, called the maximal correlation iH 10]. 
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Maximal correlation of a given bipartite probability distribution pxy is the maximum of Pear¬ 
son’s correlation coefficient over all functions of X and Y. i.e., 

. K[(f x -E[f x ])(g Y -E[gy})] 

• y ) — max - - 

\/Var [fx ] Var [gy \ 

where E[-] and Var[-] are expectation value and variance respectively. Moreover, the maximum is 
taken over all non-constant functions fx,9Y of X and Y respectively. Maximal correlation can 
equivalently be written as 



p{X,Y) = m a x E[. fx9Y] 

E x [f]=E Y \g}=0, 

E[f 2 ] = E[g 2 ] = 1. 

We always have 0 < p(X, Y ) < 1. Moreover, p(X, Y) = 0 if and only if X and Y are independent, 
and p(X. Y) = 1 if and only if X and Y have an explicit common data as defined above [31. Maximal 
correlation has the following two properties: 

• Tensorization: We have 


p(XX’,YY') = ma x{p(X,Y),p(X',Y')}, 


(3) 


when XY and X'Y' are independent, i.e., pxx'YY' = Pxy ■ Px'Y'■ 
• Data Processing: We have 


p(X',Y')<p(X,Y), (4) 

when X' —> X —> Y —> Y' forms a Makov chain. Thus maximal correlation can be thought of 
as a measure of correlation 

Applying the above two properties to the Gacs-Korner problem we find that 

p(f(X n ),g(Y n )) < p(X n ,Y n ) = p(X, Y). 

As a result, if p(X,Y ) < 1, then p(f(X n ), g(Y n )) will also be strictly less than one. Then Witsen- 
hausen’s result is obtained using a certain continuity of maximal correlation and the fact that the 
maximal correlation of two perfectly correlated bits is 1. 

More generally, the tensorization and data processing properties of maximal correlation imply 
some bounds on the problem of non-interactive distribution simulation discussed above. That is, 
if we generate random variables A and B from n i.i.d. repetitions of X and Y respectively, i.e., if 
A — > X n —> Y n —> B for some n, then 


P (A, B) < p(X, Y). (5) 

Tensorization is also helpful in distributed source and channel coding problems | lT]. For instance, 
consider the problem of transmission of correlated sources over a MAC channel. Assuming that the 
correlated sources observed by the two transmitters are i.i.d. repetitions of (A, B), their inputs to 
the MAC channel at time i which we denote by X % and 1* satisfy Xi —> A n —> B n — > Yi, and hence 
we must have p(Xi,Yi) < p(A,B). Therefore, the set of possible input distributions to the MAC is 
restricted. This can be used to prove impossibility results in transmission of correlated sources. 
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In general, if T (p) is a region for a given distribution p, we say that it tensoizes or has the 
tensorization property if 

T(pi x p 2 ) = T(pi) n X(p 2 ), (6) 

for any pi,p 2 - This in particular implies that for i.i.d. repetitions p n we have 

T(p n ) = T (p). (7) 

Equation Q is a weaker version of ([6]), and is called weak tensorization property. In this paper we 
mostly consider this weak tensorization. So when we say tensorization, we mean ([T]) unless stated 
otherwise. If T (p) is a scalar (as for maximal correlation), tensorization translates to 

T(pi x p 2 ) = max{T(pi),T(p 2 )}- 


Tensorizing regions serve as measures of correlation if they satisfy an additional data processing 
inequality. Only two examples of tensorizing regions that satisfy the data processing inequality are 
known in the literature, and the other such measures are derived from these two. One of them is 
the hypercontractivity ribbon 13 . The other one is a generalization of maximal correlation called 
maximal correlation ribbon 151. Both hypercontractivity ribbon and maximal correlation ribbon are 
subsets of the real plane and satisfy ©>■ 


1.3 Our contributions 


The key idea is that given a region 1Z that is an additive function of the joint distribution p(x \,... ,Xk), 
the cone at which 1Z is seen from zero is a tensorizable function of the joint distribution. Further¬ 
more, by subtracting any additive vector from 1Z the above statement extends to cones at which 1Z 
seen from one of its corners. This allows for introducing new measures of correlation that (weakly) 
tensorize. Our new measures are defined as the dual of the rate regions of certain source coding 
problems. Since by its operational definition, the source coding capacity region is additive, we get 
an operational proof of the tensorization property. Moreover, the source coding problems that we 
consider involve private links to the receivers, making it possible to use the Slepian-Wolf theorem 
to transmit parts of the sources through these links. We show that this implies the data processing 
property in the dual region. The operational proof of data processing does not rely on knowing the 
exact characterization of the original problem (in terms of mutual information). 

With this approach we define new regions that tensorize and satisfy the data processing in¬ 
equality. In fact, we show that hypercontractivity ribbon and maximal correlation are simply two 
members of a larger class of regions with the above properties. In particular, making connections 
with the Gray-Wyner source coding problem, we naturally extend the definition of the hypercon¬ 
tractivity ribbon to the multipartite setting. Our construction also generalizes the technique of 
initial efficiency to produce tensorizing regions from additive ones (see 22,23|). 


1.4 Structure of the paper 

This paper is organized as follows. In Section [2] we discuss how one can get tensorizing regions 
from additive ones. This is followed by a series of examples in Sections iSH and [6j where new 
multipartite and conditional regions are defined. Section [7] addresses the difficulty of computing 
regions based on auxiliary random variables, and provides an approach for finding alternative local 
regions that are easier to compute. Section [8] discusses additivity and tensorization for a two-way 
channel problem, and its application in simulating a two-way channel from another. 
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1.5 Notation 

We mainly adopt the notation of 1141. In particular, we use [k] to denote the set {1,2,..., fc}. 

We use to denote the sequence (xi, X 2 , ■ ■ ■, x*,), and xj^j to denote (x™, x?J,..., x£) where xf = 

(xji, Xj 2 , ■ ■ •, Xi n ). In general, for a subset T by xt we mean the tuple of Xj’s for z G T. The 
complement of subset T is denoted by T c . Random variables are shown in capital letters, whereas 
their realizations are shown using the lowercase letters. 

Expectation value and variance are respectively denoted by E[-] and Var[-]. When expectation is 
computed with respect to some distribution p(x) with associated random variable X, we sometimes 
denote E[-] by Ex[-]- We adopt the same notation for variance too. 

Letting p(x, y) be some bipartite distribution, the conditional expectation Ex|y[-] gives a func¬ 
tion of Y which itself is a random variable. We sometimes denote this conditional expectation by 

E[.m 

The set of real numbers is denoted by M, and M+ = [0, 00 ) denotes the set of non-negative real 
numbers. 

2 From additivity to tensorization 

Consider an arbitrary source coding problem, involving i.i.d. repetitions of random variables (X \,..., Xk), 
with some capacity rateQ region 7L(Xj,..., X^) consisting of rate tuples (R \,..., R m ). The defi¬ 
nition of the source coding problem can be quite arbitrary; we only use the fact that from the 
operational definition of the rate region we have 

(R 1 ,...,R m )en(X 1 ,...,X k ) ^ (nR 1 ,...,nR m )en(X?,...,X%), (8) 

where (X”, X%, • • •, XJj) is n i.i.d. repetitions of (AT, X 2 , ■ ■ ■, X/~). 

Let A i for i G [m], and 6g for non-empty subsets S C [k] be arbitrary real numbers. We divide 
these variables into two sets, fixing the values of variables in the first set and treating the variables 
of the second set as free variables. More specifically, let T C [m] and A C 2M \ {0} be arbitrary 
subsets, and take A t (shorthand for A i for i G T) and (shorthand for 6s for S E A) as free 
variables, and fix the remaining A t c and 0 a c as some real numbers. Then consider the following 
real valued function TW, = Fx u ....x k on the free variables and rates 

m 

Fx [k] (At, R[ m ]) = E A ^+ E °sH(Xs). (9) 

i =1 Sc[k],S^9 

By taking maximum over all rates in the capacity region we define 

Gx w (At,6 | a)= max F x[k] (\ T ,dA, R[ m ]) ■ (10) 

Now, consider the following region in Rl r l + I A I of the values for the free parameters such that G y... 
is not positive: 

T(X [fc] ) = {(A t .0a)|6V w (At,0a) < 0}. (11) 

The following theorem states that T(X^]), which can be understood as the dual of the rate region 
TZ(X^), has the tensorization property. 

lr The region TZ depends on the joint distribution p(xi,.. . ,x k ) but we adopt the common abuse of notation in 
information theory to write it as 1Z(X 1 ,..., X k ). 
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Theorem 1. The function Gx, k , (At, 0a) is additive and the region T(X^) tensorizes. More pre¬ 
cisely, for any natural number n we have 

Gx™ k] (A t, 6a) = n ■ G X[k] (At, 9a), (12) 

and 


T(*fa) = T(X [k] ). 


Proof. Observe that from equation <§ we have 


(13) 


Fxfo (At, 6a, nR[m\) = nF x [k] (At, 6a, R[ m })- 


Furthermore, by the additivity of the rate region (equation®) we have E TZ(X^) if and only 
if E lZ(Xr k ,). This implies equation (12). Equation ( |12[ ) in turn implies ( [l3| ) by the definition 

of T(X [k] ). □ 


In the above theorem we prove the additivity of Gx [k] (At, 0a) and the tensorization of T(X[ fe j) 
only in a weak sense, when we consider only i.i.d. repetitions of Xi k ]. To prove tensorization in the 
most general case, i.e., to prove we need a stronger version of the additivity of the rate region 
IZ(X^) expressed in Q. Indeed assuming that we start with a source coding problem whose rate 
region satisfies the proof of (Jfij) is obtained by a simple modification of the above argument. 
However, in this paper we mostly focus on the tensorization property in its weak sense. 

Observe that Theorem [T] still holds if we more generally replace the entropy function in equation 
([9]) with any other additive function (such as an average cost function). 

By the above theorem from any source coding problem we can define a region Y(X[ fc ]) with 
the tensorization property. Nevertheless, we would like such a region to satisfy the data processing 
property. 


2.1 Data processing 

Data processing is another property that we like to prove for T(X[ fc j). That is for any 

P(yi, ■ ■ -,Vk\x 1 , ...,x k ) = Y\p(yi\xi), 

i =1 


we would like to have 


T(X [fe] ) C T(Y[ fc] ). (14) 

The data processing property holds if we can show that Gx , fc , is decreasing under local stochastic 
maps, i.e., for any values of At and 0 a we have 

Gy [k] (At, 0a) < Gx [k] (At, 0a)- (15) 

Data processing does not hold for the dual of any arbitrary source coding problem. Indeed, 
we should consider an appropriate source coding problem and an appropriate choice of the fixed 
parameters At[ : and 0 t 2 c for the data processing property to hold. We have an operational proof of 
this property when the source coding problem is structured, which we illustrate through concrete 
examples in the subsequent sections. 
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2.2 Connection with initial efficiency 

Initial efficiency of a rate R\ with respect to a rate R -2 is defined as follows 
the maximum value of R\ when R 2 is less than or equal to r. That is, 

g(r) = max{i?i| R [rn] € n(X[ k ]), R 2 < r}. 


20 


21 


Let g(r ) be 


(16) 


Further assume that g{ 0) = 0, meaning that R 2 = 0 implies R\ = 0. Then </(0), the derivative of 
g(r) at r = 0, is called the initial efficiency of a rate R\ with respect to rate i? 2 - Initial efficiently 
quantifies how large R\ becomes when we slightly increase R 2 from 0. 

It is not hard to see that the initial efficiency tensorizes by its operational definition when we 

Then the idea of initial efficiently provides a tool to 


start with an additive rate region 122,23 


obtain functions with the tensorization property. Here show that this method is a special case of 
our construction of tensorizing regions, but before that let us clarify the idea of initial efficiency by 
an example. 

Example 2. Let us consider the example of common randomness extraction using one-way com¬ 
munication. Consider two parties who observe i.i.d. repetitions of X and Y. There is a one-way 
communication of limited rate R from the first party to the second. Then, the maximum rate of 
common randomness that can be generated from this source is 0 

g(r) = max I(X;U). 

p(u\x):I(X;U)—I(Y-,U)<r 

By definition < 7 ( 0 ) is equal to the Gacs-Korner common information. Assuming that g{ 0) = 0, the 
initial efficiency f2^j is equal to 


g'{ 0) = lim 


9{r) 


1 


where 


r\o r 1 -(s*pf ; y)) 2 ’ 
s*(X, Y) — max 


p{u\x) I{X\ U ) 

As we discuss later s* in addition to tensorization satisfies the data processing property as well. 

We now show that initial efficiency can be derived from our construction of tensorizing regions. 
Suppose that the rate region TZ(X \,..., X m ) is convex. Then the convexity of R{X \,..., X m ) 
implies that g(r) defined in © is concave. As a result, from < 7 ( 0 ) = 0 we obtain 

,, n s g(r) g(r) Ri 

g (U) = inn- = max- = max —. 

r \0 r r^O r fl [m] G 7 J(x [fc] ) R 2 

i? 2^0 

Therefore, g'( 0) is equal to the minimum value of A 2 such that R\ — X 2 R 2 < 0 for all 6 

Then defining F(\ 2 , R[ m ]) = R\ — X 2 R 2 , its associated region T is equal to [5 / (0),oo). We see that 

initial efficiency is a special case of our construction of tensorizing regions. 


Initial efficiency can be defined more generally in terms of other quantities, e.g., as in capacity per unit cost 28 . 
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3 Example 1: Lossless source coding with a helper 


In the problem of source coding with helper, there is a transmitter, a helper and a receiver. The 
transmitter has access to i.i.d. repetitions X n and the helper has access to Y n where (X, Y ) have 
a joint distribution pxy- The goal of receiver is to recover X n . See Figure [I] 

An (n, e, Mi, M 2 ) code for this problem consists of encoder maps M\ = £\ (X n ) and M 2 = 
£ 2 (T n ), and a decoder map X n = V(Mi, M 2 ). The probability of error is equal to e = p(X n / X n ), 
and the rate pair of this code is (R ,\, R 2 ) where R\ = ^log|Afi| and R 2 = MoglA^I- We let 
TZ h (X, Y) to be the set of pairs (Ri, R 2 ) for which there is a sequence of codes (n, e n , Mi, M 2 ) with 
asymptotic rate (R.i, R 2 ) such that e n —> 0 as n tends to infinity. 

Define 

Fx,y{\ R\- R.2) = ~XRi -R2 + A H(X). 

Observe that F x Y (X, Ri, R 2 ) has the format of ([9]). Accordingly define 

Gx,y(X) = max F X y (A, Ri, R2) 

{R\ ,R2)YR, h (X ,Y) 

and 

T h (X,Y) = (A| G(X) < 0}. 

Observe that (R\, 0) for sufficiently large Ri is in 1Z h (X, Y). Then A > 0 for any A £ Y h {X, Y). 

By Theorem 0 the set T h (X,Y) tensorizes, i.e. 

T h (X n , Y n ) = Y h {X, Y), Vn. 

We now show (via an operational proof) that T h (X,Y) also satisfies the data processing property. 
That is, for all stochastic maps p(y'\y) and p(x'\x) we have 

T h (X,Y) C ? h (X',Y'). (17) 


To prove this it suffices to show that for any A we have 


Gx',y'(X) < G\ y { A). 


(18) 


By the functional representation lemma 14, Appendix B], any stochastic map can be decomposed 
as adding some private randomness and application of some function. That is, there are functions 
/ and g such that X' = f{X, A) and Y' = g{X, B) where A and B are independent of each other 
and of [X, Y). Then to show (18) we need to prove the followings: 


I. If X', Y' are functions of X, Y respectively, i.e., if H(X'\X) = H(Y'\Y) = 0, then G h x , y ,(A) < 

G h xx ( A). 

II. G[\x BY (X) = Gjy(A) if A and B are mutually independent of each other, and of (X, Y). 


Putting the functional representation lemma and the above two cases together, equation ( |18[ ) is 
implied immediately. In the following we prove the above two claims separately. 


Proof of I. We need to show that for any A 

max -XR\ - R' 2 +XH(X') < max —XRi — Ro + XH(X). 

(RfR'A&R 11 ^' ,Y') (Ri,R2 )£K h {X,Y) 





Figure 1: Lossless source coding with a helper 


Using the fact that H(X) = H(XX') = H(X') + H(X\X'), it suffices to show that if (R\, R’ 2 ) G 
IZ h (X', Y'), then (R\, R 2 ) = (R' 1 +H(X\X'), R’ 2 ) G IZ h (X,Y). To show this, fix a code for the source 
(X* ,Y') with rate pair of ( R\. R ' 2 ) ■ Now consider the following protocol for the source (X, Y): the 
transmitter and helper compute X ', Y' from X, Y respectively, and then use the above code to send 
X' to the receiver. Then using the Slepian-Wolf theorem, the transmitter by sending H(X\X') 
extra bits (on average) sends X to the receiver. In this protocol the helper sends information at 
rate R 2 = R' 2 and the transmitter sends information at rate R± = R\ + H(X |X'). 

Proof of II. From the definition of the source coding problem it is clear that G AX by { A) = G h AX y (A) 
since B has the role of private randomness of the helper. It remains to show that G AXY ( A) = 
Gyy(A). Since X is a function of (A, X), using part I we have 

Gx,yW — Gax,yW- 

Thus, we need to show that G XY ( A) > G h AXY ( A), or equivalently 

max —\R\ — R 2 + A H(X) > max — \R± — R 2 + A Hi AX). 

(. Ri,R 2 )en h (x,Y) (Ri ,R 2 )en h {Ax,Y) 

To prove this we show that for any (R\,R 2 ) G lZ h (AX,Y), we have (R\ — H(A),R 2 ) G IZ h (X,Y)- 
To show this, we again use the Slepian-Wolf theorem. 

Fix (R\, R 2 ) G TZ h (AX,Y) and a sequence of codes (n, e n , M\, M 2 ) achieving this point. Since 
M 2 is generated from Y n , it is independent of A n . Then using the Fano inequality we have 

H(M 1 \M 2 A n ) = H{M i|M 2 ) - I(M 1 ;A n \M 2 ) 

= H(Mi\M 2 ) — I(MiM 2 ; A n ) 

< H(M 1 ) - H(A n ) + o(n) 

= n(R 1 - H(A) + o( 1)), 

where in the third line we use the fact that A n can be recovered from (Mi, M 2 ) with probability at 
least 1 — e n . Next, following similar ideas we have 

H(X n \M 2 A n ) = H(X n M 1 \M 2 A n ) 

= H(M 1 \M 2 A n ) + H(X n \M 1 M 2 A n ) 

< H(Mi\M 2 A n ) + o(n) 

<n(Ri-H(A) + o(l)), (19) 
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where in the last line we use the previous inequality. 

We now construct a protocol that shows (Ri — H(A),R, 2 ) G 1Z h (X,Y). Think of A as shared 
randomness between the transmitter and the receiver. Note that shared randomness does not 
change the rate region lZ h (X,Y). In the new protocol the helper uses the same encoding map to 
create M 2 from Y n . Then the receiver has A n in hand and gets M 2 from the helper. Then by 
the Slepian-Wolf theorem, if we consider N i.i.d. repetitions of this code, the transmitter needs to 
send only H(X n \M 2 ^ n ) + o(n) bits on average to convey X n to the receiver. In this protocol the 
rate of communication from the helper is R 2 and the rate of communication from transmitter is 
b H(X n \M 2 A n ) + o(l) which using (fl9j) is at most R\ — H{A) + o(l). Then (R\ — H(A),R 2 ) G 
7 l h {X,Y). 

By the above discussion T h (X,Y) satisfies the tensorization and data processing properties. 
Note that for proving these properties, we did not use the characterization of the capacity region 
TZ h (X,Y)-, we proved these properties via operational arguments and used only the Slepian-Wolf 
theorem. Nevertheless, we may use the characterization of TZ h (X,Y) to compute T h (X,Y). 

From 14, Theorem 10.2] the capacity region TZ h (X,Y) is equal to the set of pairs (i?i,i? 2 ) 
satisfying 


R 1 >H(X\U), R 2 >I(Y;U), 

for some conditional distribution p(u\y). Then for non-negative values of A we have 

G h x , Y ( A) = max A I(X- U ) - I(Y; U). 

U — Y —.A 


( 20 ) 


Therefore, A G T(X, Y) if and only if A I(X\U) — I(Y;U) < 0 for all p(u\y). Equivalently, A G 
Y h (X,Y)iS 

- > max I ^^1 = S *{Y,X). 

A U—Y—X I(Y ; U) 


Therefore, our discussion above provides a proof for the fact that s*(Y. X) tensorizes and satisfies 
the data processing inequality. 

By the above discussion s*(Y, X) is the initial efficiency of the one-helper source coding problem: 
let h(i? 2 ) be the minimum value of R\ for a given i? 2 . Then h( 0) = H(X). Let 5 (^ 2 ) = h(0) — h(R 2 )- 
Then 


s*(Y,X) 


max 

A™ en h (x,Y) 
i?2^0 


g(A 2 ) 

R2 


4 Example 2: One side-information source problem 

The one side-information source problem |24j Problem 16.6 (c)] is a generalization of the problem 
considered in Section [3] Here there are k transmitters, one helper and k receivers. Transmitter 
*, 1 < i < k, observes i.i.d. repetitions XJ 1 and the helper observes i.i.d. repetitions XJJ +1 . The 
'i-th transmitter sends information at rate Ri to receiver i, and helper broadcasts information to all 
receivers at rate Rk+i- The goal of the i-th receiver is to recover XJ 1 . See Figure [2] We denote the 
set of achievable rate tuples (Ri ,..., Rk+i) for this problem by 1Z s (Xi, ..., X ^+\)• 
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Figure 2: One side-information source problem 


To obtain a dual for this rate region let us define 

k k 

Fx [k+I] (\k], ^ik+l}) = —Rk+ 1 — ^ A iRi + y t XjH(Xj). 

i— 1 2=1 

Then let 


G V.i( A w) - K,, +1| l^,, +1 /Vd A M'W- 

and 


T s (X [k+1] ) = {\ [k] \G s X[k+i] (\ [k] ) <0}. 

Again for sufficiently large Ri,... ,R k we have (R \,..., R k , 0) E 7^ s (Aj fc+1 ]). Then for any E 
T s (A[ fc+1 j) we have A* > 0. 

By Theoremfllthe function G s x ^ (A^j) is additive and the set Y s (X[fc+i]) satisfies tensorization. 
We claim that T*(Xn. +1 ]) also satisfies the data processing property. To prove this claim it suffices 
to show that for any p(x' i \xi) we have 


G ^ fe+1] ( A [ fc ]) - G ^ [fc+1] (A[fc])- 


The proof of this inequality is completely similar to the proof of (18) given in the previous section 


and we do not repeat it in full details here. Briefly speaking, as before we first use the functional 
representation lemma to break the proof in two parts. We first consider the case where X\ is a 
function of X^\ here we argue that it suffices to show that if E 7^ s (Aj fc+1 j), then 


(■ R'l + H(X i|X0, ...,R' k + H(X k \X' k ),R' k+1 ) E n s (X [k+1] ). 


This follows again from the Slepian-Wolf theorem. Next, we show that G\ k 1 x k 1 (A[fc]) = 
Gfc (A [fc] ) when A\,...,A k+ \ are independent of each other of of A^, +1 ]. For this we show 
that if /?| fc+1] E TZ s (AiX i,..., A k X k , X k+1 ), then 


(Ri ~ H(Ai), ...,R k - H(A k ),R k+1 ) E n s (X [k+1] ). 
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This follows again from thinking of Aw\ as shared randomness among the parties and using the 
Fano inequality and Slepian-Wolf theorem. 

Now we have region T s (Xj fc+1 ]) that tensorizes and satisfies data processing. Using 
16.6 (c)], the capacity region 1Z S (A[ fc+1 ]) of this problem is given by 

Rk +1 > I(U\X k+ r), (21) 

Ri>H(Xi\U), V*€[fcj. (22) 

for some U — X k+ \ — X^.j . Therefore, for non-negative tuples A[fc], we have 

k 

G ^ +1] (A [k] ) = TT max -I(X k+ 1 -,U) + J2W(Xf,U). (23) 

U ~ X k+l~ X [k] 

As a result, A[ fc ] 6 T s (Af[ fc+ 1 j) iff 

k 

^XiI(Xi-,U)<I(X k+1 -,U), 

1=1 

for every U — X k+ \ — X^. The following theorem summarizes the above findings. 

Theorem 3. For any distribution p \ [k+1] let T s (Aj fe+1 ]) be the set of all non-negative A^j such that 

k 

J2^I(Xi-,U)<I(X k+1] U), 

i= 1 


24 


Problem 


for all p(u\x k+ i). Then Y s (X[ fc+1 ]) satisfies the data processing inequality and tensorization. 

The region T s (A^ fc+ 1 j) is non-empty; by data processing inequality if U — X k+ \ — X^ k forms a 
Markov chain, we have I(Xp U ) < I(X k+ 1 ; U). Then T s (A[ fc+ 1 j) includes any Ar fc i satisfying 0 < Aj 

and Ya=i < !• 

Example 4. Consider the special case where k = 2 and X 3 = (Ai, A 2 ). In this case Y S (A[ 3 ]) is 
equivalent to the following region: 


ft(X 1 ,X 2 ) = {(A l5 A 2 ) G R 2 +\ Ai I(X r , U) + A 2 I(A 2 ; U) < /(AjA 2 ; U)}. 


Then 9t(Ai, A 2 ) satisfies tensorization and data processing properties. 


Observe that in the special case of k = 2 and A 3 = (X\ , A 2 ), the rate region given in equations 
(21) and (22) reduces to that of the Gray-Wyner rate region [19 . Then 91(Ai, A 2 ) can be understood 


as the dual of the Gray-Wyner region. 

By the following theorem of Nair 115] gives another characterization of 91(Ai, A 2 ) defined above. 


Theorem 5 ( (T5j). (Ai, A 2 ) £ 9t(Ai, A 2 ) if and only if for every pair of functions fx 1 : X\ —> M 
and gx 2 : Y 2 —> M we have 


^[fx 1 gx 2 \ < ||/xi|| j_ \\gx 2 IIJ-, 

Ai A 2 


( 24 ) 


where the Schatten norms are defined by H/a^Ha. 


E[|/xi| 1/Al ] Al and similarly for \\gx 2 


1 . 

A 2 
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The set of pairs (Ai, A 2 ) satisfying (24) is the hypercontractivity ribbon defined in [13]. Hyper- 
contractivity ribbon is known to satisfy the data processing and tensorization. The above theorem 
gives an alternative characterization of the hypercontractivity ribbon. 

Another interesting property of hypercontractivity ribbon is that it characterizes s* (AT, Y) as 
follows: 


a*pT,y)= inf 

(Ai,A 2 )eiK(X,Y) A 2 


(25) 


For a proof of this equation see |17|. 

Example 6 (Multipartite hypercontractivity ribbon). In Theorem^ assume that (k is arbitrary 
and) Xk + i = (Xi,... ,X k ). Then Ti(Ar fe+1 ]) reduces to 


k 

*(*[*]) = {\k\ e K+| tf) < i(x [k]] u)}. 

2=1 


As a result, fH(Xr fc i) satisfies data processing and tensorization. 

Letting U = Xi we observe that if A^j E 91(A'[ fc ]) then A* < 1. Therefore, 

fH(X [fc] )C [0,l] fc . 

Furthermore, since fH(X[ fc j) is a special case of regions of the form X s , it includes any A^ satisfying 
0 < Xi and Yli =1 < 1) as argued above. 

The multipartite hypercontractivity ribbon is equal to [0, l] /,: if and only if Xj are mutually 
independent. To prove this note that if (1,1,..., 1) E iH(A'nu) then by setting U = AAi we find that 
Y)i=i H{Xi) < H(X^). Then by the subadditivity inequality of entropy, Xfis are mutually inde¬ 
pendent. On the other hand, for mutually independent variables Xi we have Yli =1 H{Xfi) = H{XuY) 
and Yl!i=\ H(Xi\U ) < H(X^\U). This shows that (1,1,..., 1) E fH(A'[ fc j). It is straightforward to 
generalize Theorem [ 5 ] of 115 to show that the multipartite region ^(A^]) has a characterization in 
terms of Schatten norms. 


5 Example 3: Fork network with side information 

The fork network with side information is another generalization of the problem we studied in 
Section [ 3 ] (see 124, Problem 16.31], [14[ Theorem 10.4]). The difference of this problem with the one 
considered in Section [4] is that there is only one decoder who needs to recover A'^j. The problem is 
depicted in Figure 3] We denote the capacity region of this problem by 77/(Ai,..., X k +i). 

As in Section [4 define 

k k 

F x [k+ 1 ] (\k]>R[k+i\) = -Rk+i ~ y] A jRj + A jH(Xj), 

2=1 2=1 

G W A i‘i> = K[ , +1]E I 41h,. +11 ,^iA A W’- R i‘«i)' 

and 

r / {V M ) = {A |l| |Gl [i+i] (A w )<0}. 
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As in the previous two sections, T^(X[ fc+1 ]) may only contain non-negative tuples A^. Again 
Theorem jlj implies that Gx [fc+1] ('Mfc]) is additive and the set T^(X[ fc+1 ]) tensorizes. 

We claim that Y^(X[ fc+1 ]) also satisfies the data processing property. To show this, we prove 
that for any p(x' i \xi) we have 

G W A w ) - G W A w>- 

Again we split the proof in two parts. When X[ is a function of Xj, the proof is identical to the one 
given in the Section^ It remains to show that G^ +1) Y[fc+1] (\k]) = ^x [fc+1 j i\k]) when A\,..., A k+ \ 
are independent of each other and of X^ k+ iy For this we need to show that if 

(R \,..., Rk+i) €E Rf(AiXi, ..., AkXk, Xk+i), 


then 

(Ri ~ H(Ai), ...,R k - H(A k ),R k+ 1 ) € .. .,X k+1 ). 

To prove this last claim, we follow similar ideas as before. We start with sequence of (n, e n , Mi ,..., M k+ 1 ) 
codes with asymptotic rate tuple {R\, ..., R k + 1 ) £ Rf {A\X \,..., A k X k , X k+ \). Take a non-empty 
subset S C [I], Letting S c = [k] — S, we have 


H{X n s \ M fc+1 Af fc] Xg c ) = H(XgM s \ M k+1 A^ c M S c) 

= H(M S \ M k+1 A^ c M S c) + H(X n s \ Af fc] M [fc+1] X&) 

< H(M S \ M k+ iA™ k] Xg c Mgc) + o(n) (26) 

= H(M S | M k+1 A% c Xg c M s c) - I(Ag; M s \M k+1 A n Sc X^M S c) + o(n) 

< H(M S ) - H{A%\M k+1 A%cX$cMsc) + H(A n s \A n g c X^M [k+1] ) + o(n) 

= H(M S ) - H(A& + H(A n s \A n Sc X^M [k+1] ) + o(n) (27) 

= H(M S ) - if(Ag) + o(n) (28) 

< (ff(Mi) - niL(Ai)) + o(n) (29) 

i£S 

= n( K Y J (Ri-H(A i ))+o{ 1 )). ( 30 ) 

i£S 


Here equations (26) and (28) follow from Fano’s inequality; equation (27) follows from the fact that 


Ag is independent of Ag c XA +| j and then of M k+ iAg c Xg c Msc; finally equation (29) uses the fact 


that A As are mutually independent. 

Now we construct a code for inputs X^+q. We think of AT as shared randomness given to all 
the parties. We assume that the encoder k + 1 creates side information M k+ \ and sends it to the 
receiver as before. Then the receiver has side information M k+ \ A^ and wants to decode X^. To 
this end, we use the Slepain-Wolf theorem which states that the recovery of Xj^ is possible if the 
7—tli transmitter, for 1 < i < k, sends information at rate R[ assuming that for every subset S C [k] 
we have 

Y J R'i>H{X n s \X n Sc M k+ ,A 
ies 


W 


where S c = [k] — S. However, from (30) we have 


tf(Xg|M fe+1 Af fe] Xg c ) < n(j2 ~ H(Ai))+o( 1)). 


i£S 
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Figure 3: Fork network with side information 


Therefore, if we set R'i = n(Ri — H(Ai)+o(l)), the necessary conditions of the Slepain-Wolf theorem 
with side information at the decoder are satisfied. Thus, we can transmit N repetitions of X™ at 
the average rate of n(Ri — H(Ai) + o(l)). This shows that 

(Ri ~ H(Ai), ...,R k - H(A k ),R k+1 ) G n f (X u ..., X k+1 ). 


The above discussion implies that (X^ +1 ]) satisfies data processing and tensorization. 
According to 14 Theorem 10.4], the capacity region H* (X\ k+ -n) consists of tuples i?r fe+1 ] such 


that 


Rk+ 1 > I(U; X k+ i), 

J2Ri>H(X s \UX S c), VS C [k], 
ieS 


for some U — X k+ \ — X^. 

Let us consider the special case k = 2. Then, the rate region is described by 


(31) 

(32) 


R 3 >I(U-X 3 ), 

Ri > H(X l \UX 2 ), 
R2 > H(X 2 \UX 1 ), 
R 1 + R 2 >H{X 1 X 2 \U). 


The corner points of this region are 

(R u R 2 ,R 3 ) = (H(Xi\X 2 U), H(X 2 \U), I(U ; A 3 )), 


and 

(R 1 ,R 2 ,R 3 ) = (H(Xi\U), H(X 2 \XiU),I(U ; A 3 )). 


Since Gjf (\k]) involves maximization of a linear function, its maximum occurs at one of these 
corner points. Then one can verify that for non-negative values of Ai and A 2 we have 


g£ m (A i,A 2 ) 


max 

/7-X3-X1X2 


-I(A 3 ; U) + Ai/(A i; U) + A 2 /(A 2 ; U) + max^, A 2 }/(A i; X 2 \U). 
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Hence, we have the following theorem. 

Theorem 7 . The following region satisfies data processing and tensorization: 


?f(X u X 2 ,X 3 ) = {(\ l ,\ 2 )eR 2 + \\ 1 I(X l -,U) + \ 2 I(X 2 -U) 

+ max{Ai,A 2 }/(X i; X 2 |H) </(X 3 ;t/), Mp(u\x 3 )}. 


Note that the above region differs from the hypercontractivity ribbon as it includes the term 
max{A 1; \ 2 }I(Xi] X 2 \U). 

By setting U to be a constant random variable, we observe that T-' (Xi, X 2 , X 3 ) = {(0,0)} if 
J(Xi;X 2 ) > 0. Therefore, to get a non-trivial region one must have I{X \: X 2 ) = 0. Assuming this 
and using the expansion I(X\; X 2 \U) — I(X i; X 2 ) = —I(X i; U ) — I(X 2 ; U) + I(X iX 2 ; U ), we observe 
that Y-^(Xi, X 2 , X 3 ) has the following alternative characterization (when I(Xi;X 2 ) = 0): 

T'iXu X 2 ,X 3 ) = {(A x , A 2 ) G Rl\ min{0, A, - A 2 }/(X i; U) + min{0, A 2 - Ai}/(A 2 ; U) 

+ max{Ai,A 2 }I{X 1 X 2 ;U) < I(X 3 -U), Vp(u\x 3 )}. (33) 


The above expression allows for an explicit characterization of the set of pairs (A, A) G T 2 (Xi, X 2 , X 3 ). 
Indeed, (A, A) is in Y(Xi, X 2 , X 3 ) if and only if 


1 . I{X\X 2 \ U) 

— > max — --— 

A U—X 3 —X 1 X 2 I(X 3 -U) 


s*(X 3] X 1 X 2 ). 


6 Conditional tensorization 


Consider the source coding problem of Section [4j Let us provide all of the parties (encoders and 
decoders) with i.i.d. repetitions of some random variable Z, which is jointly distributed with X\ k+1 y 
This is similar to the idea of Coded Time Sharing 14, Sec. 4.5.3]. Then one can see that the 


capacity region 7£ c (Ai,..., X^+i, Z) for this problem is equal to the one given in equations (21) 
and ( 22 ), except that everything gets conditioned on Z: 


Rk+i > I(U;X k+ 1 \Z), (34) 

Ri > H{Xi\UZ), Mi G [k\, (35) 


for some U—Xk+iZ—X^. This region results in the following region Y c (Xi,..., X/ z+ \, Z) consisting 
of all non-negative A i such that 


k 

Y,*ii{Xi\U\z) < I{x k+ r,u\z), 

Z— 1 


for all p(u\zxk+i)- 

Let us consider the special case of k = 2, A 3 = (Ai, X 2 ): 

Theorem 8 (Conditional bipartite hypercontractivity ribbon). Let 

Jt(Xi,X 2 \Z) = {(Ai, A 2 ) G M+l Ai/(A i; U\Z) + A 2 /(A 2 ; U\Z) < I(XiX 2 ; U\Z), MU}. 

Then we have 

• Tensorization: 94(Xi, X 2 |Z) = iH(A{ 1 , Xlf \Z n ) if(Xf, Xf , Z n ) isn i.i.d. repetitions of (Xi, A 2 , Z). 
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• Data processing: yK(X\,X 2 \Z) C tK^X'^ X' 2 \Z) for any p{x' l \x\z) and p(x 2 \x 2 z). 

The above properties of the conditional hypercontractivity ribbon can be operationally proved 
as before. Alternatively we have the following characterization of the conditional hypercontractivity 
ribbon from which the above theorem is implied. 

Lemma 9. We have 


<R(X l ,X 2 \Z) = Pi m(X 1 ,X 2 \Z = z). (36) 

z:p(z)>0 


Proof. It suffices to show that 

Ar/(X i; U\Z) + A 2 I(X 2 - U\Z) < /(X^; U\Z ), 

if and only if 

\iI{Xr, U\Z = z) + A 2 I(X 2 - U\Z = z)< I(X iA 2 ; U\Z = z), 


VC/ 


VC/ 


(37) 


(38) 


for all z with p(z) > 0. Clearly, equation (38) implies (|37j). To see the converse, given any arbitrary 


z*, observe that we can choose U to be a constant if z / z* 


G 


One can similarly define conditional s*(X\,X 2 \Z) either using (25) as 

1-A, 


s*{X 1 ,X 2 \Z) = 


inf 


(Ai,A 2 )e9t(Xi,X 2 | z) A 2 
or directly from the source coding problem of Section [3] as 

I(X 2 ,U\Z) 


s*(X u X 2 \Z)= max 


max s*(X\,X 2 \Z = z). 


U—ZX l— X 2 /( X\ } U\Z) z where p(z)>0 

These two definition coincide as can be verified using their equivalency in the unconditional case. 
Moreover, they match with the definition of s* z {X\Z, X 2 Z) given in 1221. In Appendix [A] we study 
the relation between conditional s* and conditional maximal correlation. 

Conditional hypercontractivity ribbon is useful in studying tensorization for two-way channels, 
as recently shown by authors in |5|. We briefly discuss this in Section [8| Also, an application of 
conditional hypercontractivity ribbon for secure distribution simulation is given in Appendix |Bj 

7 Computation of the regions and their local perturbation 

Explicit computation of the tensorizing regions defined so far for a given joint distribution can 
be computationally cumbersome, specially for distributions defined on large alphabet sets. This 
computation can be relatively simplified if one observes that expressions with auxiliary random 
variables generally have alternative representations in terms of lower convex envelope^ (see e.g., 
fl6j). Consider for instance 

s*{X,Y)= sup 

U-X-Y / (C, A ) 


li A lower convex envelope of a function is the largest convex function that lies below the function. 


17 








A representation of this quantity in terms of lower convex envelopes is given in 17 . Indeed, s* (A, Y ) 
can be written as the minimum value of A such that 


H(Y) - A H(X) < min [H(Y\U) - XH(X\U)]. 

U :U —A — Y 


(39) 


The right hand side of this equation has a representation in terms of the lower convex envelope 
operator as follows. Given p(x,y) = p(x)p(y\x), we fix the channel p(y\x) and vary the input 
distribution to define the following function 

t x (q(x)) = H{Y) - A H(X), 

where entropies are computed with respect to q(x,y ) = q(x)p(y\x). Then 

U: mm_ Y [H(Y\U)-\H(X\U)], 


is the lower convex envelope of the function t\(q(x)) at q(x) = p(x). Equation (39) then implies 
that s*(A, Y) is the minimum value of A such that the function t\(q(x)) touches its lower convex 
envelope at p(x). 

The lower convex envelope operator is still a global operator. In order to further simplify the 
computation, one can replace lower convex envelopes with the weaker constraint of local convexity, 
i.e., to consider the minimum value of A such that the function t\(q(x)) is locally convex (has a 
positive semi-definite Hessian) at p(x). This quantity is clearly a lower bound on s*(X, Y), and is 
shown in 17 to be equal to p(X,Y) 2 , where p(X,Y) is the maximal correlation between X and 
Y. The quantity p(X, Y) has an efficient representation in terms of principal inertia components 
(see |26, Sec. II. B] and references therein). As discussed in the introduction it also satisfies the 
tensorization and data processing properties. 

More generally, in [5] the local approximation of the bipartite hypercontractivity ribbon is 
derived and the maximal correlation ribbon is defined. It is shown that this ribbon satisfies ten¬ 
sorization and data processing. One can apply this idea of local approximation to other regions 
defined in this paper. Here we do this for the region given in Section |4j 

In Section [IJ it was shown that 


G s 


x. 




-I(X k+1] U) + Y^\I(Xf,U), 


i— 1 


is additive and satisfies the data processing inequality. This function can be written as 

k r k 


G s 


x i 


[fc+i] 


(A[fc]) = -H(X k+1 ) + £ A iH(Xi) + max 

U— A[ fc ] 


i =1 
k 


-H(X k+1 ) + A iH(Xi 


2=1 


mm 

U—Xk+i—X[k\ 


H(X k+1 \U) -J2 XiH ( Xi \ U ) 
2=1 
k 

H{X k \U) +Y J Y<H{X i \U) 


i=1 


This function is less than or equal to zero if and only if for any p(u\x k -\-\) we have 

k k 


-H(X k+1 \U) + KH(X.i\U) > -H(X k+1 ) + Y, A iH{Xy 


2=1 


2=1 


(40) 
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In other words, we have A[ fc ] G T s (X[ fc+1 ]) if the function 


t\ [k] {q(xk+ 1 )) = -H(X k+ 1) + Y A iH(Xi), 


( 41 ) 


2=1 


when we fix p(x [ k ] |xfc+i) and vary the marginal distribution of X k+ \ , lies on its lower convex envelope 
at <?(x fc+ i) =p(x fc+ i). 

Now, instead of being on the lower convex envelope, we look at the local convexity of t \^ fe . (■) 
at q(x k+ 1 ) = p(xfc_|_i). Local convexity is a necessary condition for being on the lower convex 
envelope. To verify local convexity, consider a local perturbation of the form q t (x k+ \) = p(x k+ i)(l + 
ef(x k+ 1 )). Assuming that E[/(Xfe +1 )] = 0, then for sufficiently small |e|, this equation defines a 
valid distribution. Then we may consider the distribution q e {x\k+i |) = qe{x k+ i)p(x[ k ]\x k+ i). The 
second derivative of (41) with respect to e at e = 0 is equal to |18| 

Qft\ w M*[fc+ 1 ])) _ n = E[f(x k+1 ) 2 } - J2 x l m[f(x k+1 ) 2 \x i }}. 


e=0 


i=l 


We would like this to be non-negative for all valid perturbations /. Then we obtain the following 
new region. 


Definition 10. Define 

k 

A s (X [k+1] ) = {A [fe] G M^E[/(X fc+1 ) 2 ] > ^A i E[E[/(X fc+1 ) 2 |X i ]],V/(X fc+1 ) : E [f(X k+1 )] = 0} 

2—1 

k 

= {A [fe] G R* I Var[/(X fc+1 )] > ^ A,Var Xi [E Xfc+1 | X J/(X fc+1 )]], V/(X fc+1 )}. 

2 — 1 

The region A S (AA +1 ]) again satisfies data processing and tensorization. To prove this we define 
the following function 


G 


x , 


[fe+i] 


(^[fe]) 


max 
/(A' fe+ i) 


k 

Var [f(X k+1 )\ + ^xA^x k+1 \x i [f(X k+1 )}} 

2=1 


(42) 


Then the data processing and tensorization of A s (X[ fc+1 ]) is equivalent to the data processing and 
additivity of G s x (A^j). 


Comparing equations (40) and (42), we see that the term I(U;Xi ) is replaced with 

Var*JE Xfc+1 | Xi [f(X k+1 )}]. 


This suggests that an algebraic proof of additivity and data processing of G s Y[fc+1] can be mim¬ 
icked to obtain a proof of these properties for G s x ^ + ^. Indeed using Table |?J we may transform 
any algebraic relation between quantities in terms of mutual information, to a similar equation in 
terms of variance. In particular, the chain rule for mutual information corresponds to the law of 
total variance. The fourth property I(XJ]C\DE) > I(U;C\D ) holds for mutual information since 
/([/; C\DE) = I(UE; C\D) > I(U;C\D). The proof of its analogue for variance is similar and 
can be found in 15, Lemma 30]. Using these properties, we show in Appendix [C| that a proof of 
additivity and data processing for G s x gives a similar proof for G s x . For another proof of 
this type, see the proofs of the data processing and tensorization properties of hypercontractivity 
ribbon and maximal correlation ribbon in |5j. 
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Mutual Information 

Variance 

1 

I{U; B ) with U - A- B 

Var B [E A , B [f(A)}} 

2 

I(U ; C\B) with U-A-BC 

E B Var C | jB [E j 4| BC [/(A)]] 

3 

Chain rule 

I(U ; BC ) = I(U; B) + I(U ; C\B ) 

Law of total variance 
Var sc [E A | SC [/(A)]]=Var B [E A | B [/(A)]] 

+E B Var C | B \E A \ BC [f (A)]] 

4 

I(U;C\DE) > I(U; C\D) 
iiC-D-E, U-A-CDE 

E DE 'Var C \ DE 'E A \ CDE [f(A)} > E c Var C | D E j4 | CD [/(A)] 
if C-D-E 


Table 1: Algebraic similarities between mutual information and variance 


8 Two-way channels 


So far we have only considered source coding problems. We now consider a two-way channel coding 
problem. Let us begin by motivating our problem. Let p(y\x) and q{y\x) be two point-to-point 
channels. The question is whether we can simulate one use (copy) of the channel q(y |x) from 
arbitrarily many uses of p(y\x). In other words, given some arbitrary small error e > 0, can we 
find some n and (possibly randomized) encoder £ : x *-)• x n and decoder V : y n i-»- y such that the 
induced conditional distribution of y given x is within the e distance of q(y\x) for every x,yl This 
question for point-to-point channels as stated here, is easy to answer. Indeed, if the capacity of 
q(y\x) is zero, then we only need local randomness to simulate it. Otherwise, simulation is feasible 
iff the capacity of p(y\x) is non-zero. We observe that the answer to the simulation problem for 
point-to-point channels is easy since such channels with zero capacity have a trivial characterization. 

Let us ask the same question for two-way channels: can we simulate a single copy of q{y\,y 2 \x\, x 2 ) 
from an arbitrary number of copies of p(yi, V2\xi, x 2 )? More precisely, is there n and local encoding 
maps £i : Xi x", for % = 1,2 and decoding maps D % : yf i-)- yi such that the induced conditional 
distribution of (y \, 1)2) conditioned on (x'i, x 2 ) is within e distance of q(yi, ^ 2 |^i, x 2 )? 

We may make this problem even more general by adding feedback to the channel. In this case 
the /’-tli encoder, i = 1,2, before using the j-tli copy of p(yi,y 2 \xi,x 2 ) have access to the outputs 
of previous channels. More specifically, assume that there are two parties who have the channel 
P(yi,y2\xi,x 2 ) as a resource between them, which they can use arbitrarily many times. To begin 
with, the z-th party, i = 1,2, is given Xj, the input of the channel to be simulated. The i-tli party 
creates input X^j at time instance j, using his past inputs and outputs of the channel, i.e., from 
(xifXqj^Yfy-y). After feeding (X 1 j,X 2 j) to the j-th copy oip(yi, y 2 |xi, x 2 ), the output (Yij,Y 2 j) 
is generated. Finally, after using the two-way channel p(yi, y2\x\, x 2 ) for n times, the i-th party 
creates 1) from (x*, X ^, Y)r n ]) to create 1). We need the imposed conditional distribution on Yi,Y 2 
to be close to q(yi,y 2 \xi, x 2 ). 

To answer the possibility of channel simulation in the bipartite case as above, it is appropriate 
to restrict ourselves to zero-capacity channels (i.e., to channel whose capacity region is C = {(0,0}). 
The point is that (unlike the point-to-point case) there are non-trivial two-way channels with zero 
capacity. 

Consider for instance, the following class of zero-capacity channels with binary inputs and binary 
outputs (i.e., yi,y 2 ,xi,x 2 € {0,1}): 


PRr,(2/i,2/2|xi,x 2 ) 


l+v 

4 


1 -V 
4 


if 2/1 © P 2 = xi A x 2 , 
otherwise, 
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(43) 




where 0 < ry < 1. Then the following statement is proved in our recent work |5j. 

Theorem 11. J5jl For 1/2 < r/i < r/2 < 1, two parties cannot use an arbitrary number of copies of 
PR r?1 to generate a single copy of PR^ • 

Our goal here is to illustrate this result from the perspective of additivity and tensorization, 
based on the ideas we developed. 

Given p(yi,y 2 ), define 


GS 1)A2 (n,*2) 


max -/([/; Y X Y 2 ) + A x I{U; Y x ) + \ 2 I{JJ ; Y 2 ). 
p(u\yiV2) 


(44) 


Observe that this function is the one for bipartite hypercontractivity ribbon and is a special case 
of (23). Therefore, it satisfies the data processing and additivity properties. Now, given a two-way 
channel q(yi,y 2 \xi,x 2 ), let 


G Ai,A 2 0?0 /i ,y 2 \xi,x 2 )) = maxG z XlX2 (Y 1 ,Y 2 \X 1 = Xl ,X 2 = x 2 ). 

Observe that G x X2 (q(yi, y 2 \x±, x 2 )) indeed corresponds to the conditional hypercontractivity rib¬ 
bon of outputs given inputs, as in Lemma [9} The following lemma is the key step to prove Theo¬ 
rem [TTJ 

Lemma 12. Assume that ( A, B ) are sampled from some bipartite distribution p(a,b). Suppose that 
we create X\ as a function of A, and X 2 as a function of B. Then ( X\,X 2 ) are put at the inputs 
of a two-way channe p(yi,y 2 \x\,x 2 ) which outputs (Y x . Y 2 ). Then for any X±, \ 2 > 0 we have 


Gx u xM Y i’BY 2 ) - \\I(X 2 ; Yil-Xi) - \ 2 I(Xr,Y 2 \X 2 ) < G z XlM (A, B) + G z XlM (p( yi ,y 2 \x lt x 2 )). 

Assuming this lemma the following theorem gives a method for proving the impossibility of 
channel simulation. 


Theorem 13. For any two-way channel p(x\, x 2 \yi, y 2 ) let 

T z (p(x ll x 2 \yi, y 2 )) = {(Ai,Ai) G [0,1] 2 | G z XlM (q(yi, y 2 \xi, x 2 )) < 0}. 


Assume that p(xi,x 2 \yi,y 2 ) has zero capacity. Then simulation of q(yi,y 2 \xi, x 2 ) withp(xi,x 2 \y\,y 2 ), 
as defined above, is possible only if T z (p(xi, x 2 \y 2 , y 2 )) C r T z {q(fyi,y 2 \xi 1 x 2 )). 


Proof. Let A and B respectively denote all information available to the two parties (including their 
private randomness) before using the two-way channel p(yi, y 2 \xi, x 2 ) at some time step. Then 
their available information after using the channel is AY\ and BY 2 . When the channel has zero 
capacity, we have I(X 2 -, Y\\X\ = x\) = 0 for every value of x±, 1 14;, Proposition 17.2]; similarly, we 
have I(Xi;Y 2 \X 2 = x 2 ) = 0 for every value of x 2 . Thus, I{X 2 \Y\\Xi) = I(Xi;Y 2 \X 2 ) = 0 for any 
p(xi,x 2 ). Then by Lemma 12 we have 


^Ai,a 2 (^" 1 ’ B\ 2 ) < G Xl X2 {A,B) + G Xl X2 (p(yi,y 2 \xi,x 2 )). 

This means that, if G z XiM (A,B) < 0 and G z XiM (j)(yi, y 2 \xi, x 2 )) < 0, then G Xi X2 (AYi, BY 2 ) < 0. 

Now consider a simulation code with error e. The initial information state is ( T[,T 2 ) = 
(xiTi, x 2 T 2 ), where x\ and x 2 are two constants (the inputs of the channel we want to simu¬ 
late), and T\ and T 2 are two mutually independent private sources of randomness. Since are 
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independent for any (Ai, A 2 ) G [0, l ] 2 we have G A = 0. Therefore, if (Ai, A 2 ) is such that 

G z x 1 A (p(yi, y 2 |xi, x 2 )) < 0, by repeating the above argument we find that 

G XlM (x 1 X 1[n] Y 1[n] ,x 2 X 2[n] Y 2[n] ) T 0 

at the final stage of communication. From the data processing property of Ao , we find that 
G\ 1 A2 (Yi,y 2 ) < 0. Thus, for any arbitrary p(u|yiy 2 ), we have 

-/([/; T,y 2 ) + A!/([/; Fi) + X 2 I(U ; T 2 ) < 0. 

Now, letting e converge to zero and using the continuity of mutual information in the underlying 
distribution, we get that (Ai,A 2 ) belongs to Y z (q(yi, y 2 |xi, x 2 )). This gives the desired result. O' 


We now give a proof for Lemma 12 


Proof of Lemma [7g[ Take somep(u, o, b, x\,X 2 , 2/1 , 2/2 ) that achieves the maximum in G A A (Ayi, BY 2 ). 
Then we have 


1(17; Y 1 Y 2 AB) = I{U-,AB) + I(U-Y 1 Y 2 \AB) 

= I{U ; AB) + I{U- Y ] Y 2 \ABX ] X 2 ) (45) 

= I(U- AB) + I{UAB ; FiT 2 |AfW 2 ) 

= /([/; AB) - Ai J(C7; A) - A 2 /(0; B) 

+ J(£7AB; - Ai I{UAB\ Y\X x X 2 ) - X 2 I(UAB; Y 2 \X l X 2 ) 

+ Ai/(O; A) + A 2 7(Z7; 5) + Ai I(UAB-, Y^X^) + A 2 I{UAB; Y 2 \X 1 X 2 ) 

> -Gi u x 2 ( A ,B) -GA 1)Aa (9(yi,2te|*i,* 2 )) 

+ Ai/(f7; A) + A 2 /(C7; 5) + X\I{UAB\ Y^X^) + A 2 I(UAB; Y 2 |XiX 2 ), 


where equation (45) follows from the fact that Xi and X 2 are functions of A and B respectively. 
Since p(u, a, b, xi,x 2 ,yi, 2 / 2 ) achieves the maximum in G A A (Ayi, BY 2 ), we have 

-Gj liAa (Ari,sy 2 ) = /(c^y^AB) - a^^wa) - x 2 i{u-y 2 b). 


Hence, 


- G A 1 , Aa 0 4y i> sy 2) > 

+ \ 1 I(U-A) + \ 2 I(U-B) + \ 1 I{UAB-Y 1 \X 1 X 2 ) + A 2 I(OAB ; y 2 |XiX 2 ) 

- XiI(U- Y^A) - X 2 I{U-,Y 2 B) 

= ~Gi uXa ( A ,B) ~ G z XljX2 {q(yi,y2\x 1 ,x 2 )) 

+ Xi[I(UAB] Y\\XiX 2 ) - HU-Y^A)] + X2[I{UAB-Y 2 \X l X 2 ) - I{U-Y 2 \B )] 

= -g z Xi ,\ 2 ( A ’ B ) ~ G x u x2(9(yi,y2\ x i, x 2)) 

+ X 1 [I(UAB-,Y 1 \X 1 X 2 )-I(U-,Y 1 \AX 1 )\ + X2[I(UAB-,Y 2 \X 1 X2)-I(U-,Y 2 \BX2)\ 
= -G z x u x a (A,B) - G z XlM {q( yi ,y 2 \ x i, x 2 )) 

+ X 1 [I{UABX 2 -,Y 1 \X 1 ) - I{U-Y 1 \AX 1 )\ + X 2 [I{UABX 1 -,Y2\X 2 ) - i{u-,y 2 \bx 2 )\ 

- X 1 I(X 2 -,Y 1 \X 1 ) - A 2 /(AT i; y 2 |X 2 ) 

> -g z XiM ( a , b ) - G l u x a (q(yi,y 2 \ x i, x 2 )) 

- Ai I(X 2 - Y^Xi) - X 2 I(X 1 -, Y 2 \x 2 ). 


□ 
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9 Conclusion and Future Work 


In this paper we defined new classes of measures of correlation that satisfy the tensorization prop¬ 
erty. These measures were defined using additive functions, which themselves are useful for the 
non-interactive distribution simulation with a non-zero rate. Conditional versions of the proposed 
measures are derived, and are shown to be applicable to the secure distribution simulation prob¬ 
lem. Since explicit computation of the proposed regions is generally difficult, we looked at local 
perturbation of the regions. Tensorization and data processing of the local regions can be shown 
via an analogy between propoerties of mutual information and variance. In the appendices, we 
study different characterizations of the multi-partite HC ribbon. We also define a new multi-partite 
maximal correlation. 

All the source coding problems that we considered have a capacity region characterized by a 
single auxiliary random variable. It would be interesting to consider problems with more than one 
auxiliary random variable. Except for the section on two-way channels, our main emphasis was on 
the source coding problems. It would be interesting to explore tensorizing measures for channels. 

The multi-partite HC ribbon has a description in terms of Schatten norms. For this reason, it 
has found applications in other areas of mathematics. It would be interesting to see whether other 
regions defined in this paper have similar characterizations. 

Finally, we defined a notion of multi-partite maximal correlation. It would be interesting to see 
if this measure is related to the maximal correlation ribbon (MC ribbon). The MC ribbon is the 


Appendix 


local perturbation of the HC ribbon, and can be derived by setting X^+i = X^ in Definition 10 


A Conditional p and s* 


We need the following definition: 

Definition 14 (Conditional Maximal Correlation). J27] For a tripartite distribution p(x , y , z), the 
conditional maximal correlation p(X,Y\Z ) is defined as 

P (X, Y\Z) = max p(X,Y\Z = z). 

z:p(z)> 0 


Lemma 15. 27] We have 


P 2 (X,Y\Z) 


max E Y zmx\Yz[f(X,Z)]) 2 }, 
E[/|Z]=0,E[/2] = 1 1 


where maximum is taken over all functions f : X x Z —> M. 


Proof. Here we briefly explain the idea of the proof. We first note that 


p(X, Y\Z) = max E[f(X, Z) g(Y, Z )] 

^X\z[f] = E Y\z[d} = 0, 

E[/ 2 ] = E[ 5 2 ] = 1. 


To verify this, it suffices to expand the expectations E[-] as Ez[Exy|.z[']]j an d instead of functions 
f(X, Z),g(Y, Z) to consider pairs of functions ( f(X , z),g(Y, z)) for all z. 


23 




Now having the above characterization of conditional maximal correlation we can prove the 
lemma. The point is that if we fix f(x,z), by the Cauchy-Schwarz inequality, the optimal g(Y,Z) 
will be proportional to IE x \Yz[f(X, Z)\. 


□ 


From this definition we have p(X,Y\Z) 2 < s*{X,Y\Z) since p(X,Y\Z = z ) 2 < s*(X,Y\Z = z) 
for every z with p(z) > 0. Before stating another connection between s* and p, we need the following 
alternative characterization of s*(X,Y\Z). 


Lemma 16. We have 


S *(X, Y\Z) 


sup nu\Y\z) 

U: U—XZ—Y,I(U;Z)=0 i(U;x\zy 


In other words, in the definition conditional s* the supremum with, or without the constraint 
I(U\Z) = 0 gives rise to the same value. 

Proof. Take some p(u\x, z), so that U — XZ — Y forms a Markov chain. By the functional represen¬ 
tation lemma |l4 , Appendix B] applied to p{u\z), one can find p(u,vf,z ) where U' is independent 
of Z and H(U\U'Z) = 0. Next, define the joint distribution 

p{u , u, x, y, z) = p(u , z)p(u\u , z)p(x\u, z)p(y\x, z), 

whose marginal distribution on (U, Z, X, Y) is the one we started with. Observe that we have Markov 
chains U' — XZ—Y and U' — UZ — XY. Then 1(17; Y\Z) = I(U'\ Y\Z) and 7(17; X\Z) = I(U';X\Z). 
Hence, 

I(U-,Y\Z) _ I(U'-Y\Z) 

i(u ; x\z) i{u'-x\zy 

and we have I{U'\ Z) = 0. P 

We are now ready to provide an alternative characterization of conditional p and s* in terms of 
lower convex envelopes. This generalizes such a characterization of |17| to the conditional case. 

Fix p(z) and a channel p(y\xz). Then for A E [0,1] define the following function of p{x\z): 

t x (p{x\z)) = H{Y\Z) - XH(X\Z). 

Theorem 17. The following statements hold: 

(i) p 2 (X, Y\Z) is the minimum value of X such that the function t\ has a positive semidefinite 
Hessian at p(x\z). 

(ii) s*(X,Y\Z) is the minimum value of X such that the function t\ touches its lower convex 
envelope atp(x\z). 

Proof, (i) This follows from the following characterization of conditional maximal correlation: 

P(x - Ylz)= w,-, 

Take an arbitrary perturbation of the form p e (x, z) = p(x, z)(l + ef(x, *)) such that p e (z) = p(z). 
For p e to stay a valid perturbation we need E[/] = 0, and for it to satisfy p e (z) = p(z), we need 
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E[/|Z] = 0. Furthermore, we can normalize / by assuming that E[/ 2 ] = 1. With these constraints 
we obtain a conditional distribution p e (x\z) for sufficiently small |e|. Then we have 


de 2 


t\(p e (x\z)) 


E=0 


-E[E [f(X, Z)\YZ ] 2 ] + AE[/ 2 (X, Z)\ 
-E[E[/(X, Z)\YZ] 2 ] + A, 


which is non-negative as long as A > EpE[/(X, Z)\YZ] 2 ]. Thus the minimum value A* such that the 
second derivative is non-negative for all local perturbations is 

A* = r nunc E YZ [(Ex\Yzf(X,Z)) 2 }. 

E[f xz \Z]=0,E[p]=l 1 


(ii) Consider the minimum value of A, say A, such that the function t\ touches its lower convex 
envelope at p{x\z). This means that A is the minimum A such that 


H(Y\Z) - XH(X\Z) < H(Y\UZ) - XH(X\UZ), V U : U — XZ — Y,I(U; Z) = 0. 


Note that if U is conditionally independent of X, i.e., I(U]X\Z) = 0, then the above inequality 
always holds. So let us further assume that I(U;X\Z) > 0. Then rewriting the above equation, we 
ford that A is the minimum A such that, 


Thus, 


I(U-Y\Z) 

~ mx\zy 


V U :U - XZ -Y with I{U; Z) = 0, 1(U; X\Z) > 0. 


A 


U: 


sup 

U-XZ-Y,I(U\Z )=0 


m y\z) 

i{u-x\zy 


□ 


B Secure distribution simulation: an application of conditional 
hypercontractivity ribbon 

Consider two parties and an adversary who observe i.i.d. repetitions of X\ and X 2 and Z respectively. 
The goal of the parties is to securely generate a single copy of (Y \, Y 2 ) with a given distribution 
q(2/i , 2 / 2 ) under local stochastic maps. More precisely we say that secure non-interactive simulation 
of (Y \, > 2 ) from i.i.d. repetitions of (Xi, X 2 , Z) is possible if for every e > 0 there is n such that the 
parties can generate a single copy of Y\ and Y 2 as stochastic functions of Xf and X£ respectively 
such that 

• Reliability constraint: (Y\. > 2 ) has a desired joint distribution 9 ( 2 / 1 , 2 / 2 )? he., the joint distri¬ 
bution of the simulated random variables p(yi,y 2 ) is e-close to < 7 ( 2 / 1 , 2 / 2 ): 

\\p(yi,m) ~ 9(2/1, 2 / 2 )||i < e. 

• Security: (Yi. > 2 ) is almost independent of Z n : 

I(YY 2 ; Z n ) < e. 
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This following theorem gives a bound on the problem of secure distribution simulation based on 
conditional hypercontractivity ribbon. 

Theorem 18. If secure distribution simulation is possible, then we have 

<R(X 1 ,X 2 \Z)C<R(Y 1 ,Y 2 ). 


Proof. We have 


K(X 1 ,X 2 \Z) = 1R(X?,X2\Z n ) 

CfRiYuYitlZ”) 

= p| M(Yi,Y2\z n ). (46) 

z n :p(z n )> 0 

where the first equation follows from the tensorization of conditional hypercontractivity ribbon and 
the second equation follows from its data processing property. 

Observe that 


I{Y\Y 2 \ Z n ) = ^2P(z n )D(p(y 1 y 2 \z n )\\p(y 1 y 2 )) > 2 ^p{z n )\\p{yiy 2 \z n ) - p{yiy-2)t 


where we use Pinsker’s inequality. Assuming that the left hand side is at most e, there is some zff 
such that p(zq) > 0 and 


\\p(yim\zS) ~p(ym)\\ < \/^- 


Now using (46), we have %\{X\,X 2 \Z) C 9\(Yi,Y 2 \zq). This means that, if (Ai,A 2 ) E D\(Xi,X 2 \Z), 
then (Ai, A 2 ) E 9\(Yi,Y 2 \zq), i.e., for any arbitrary p[u\y\y 2 ): 

A 1 I(U-,Y 1 \Z n = z£) + \ 2 I{U-%\Z n = zS) < I(U;Y 1 Y 2 lZ n = z%). (47) 

On the other hand by triangle inequality \\p{yiy 2 \zQ) — < 7 ( 2 / 12 / 2 )||i < e + i/e/2. Then we may use 
the Fannes inequality to approximate each term of (47) by an unconditional mutual information. 
Indeed, as e —> 0 we obtain 


Ai I{U- Y\) + A 2 /(l7; Y 2 ) < I(U ; F,T 2 ). 


Thus, (Ai,A 2 ) €JR(Yi,y 2 ). 


m 


Additivity and data processing of G X [k+ 1 ] (\k}) 


Our goal in this appendix is to prove the additivity and data processing properties of ^x fc+l (A[fc]) 


defined in (42). For this we first give an algebraic proof of these properties for G s x (A[fc]) defined 


in (40) and then using the recipe of Table |tJ we convert it to a proof for G s x ^ + ^ (A[ fc j). 
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C.l Additivity 

We start by showing that G s x^ k+] (^[fc]) * s additive. That is, if A'[ fc+1 ] and Yj fc+1 ] are independent 
(but not necessarily identically distributed), then G s x fc+i y fe+1] (A[ fc j) = G xk+1 (A[ fc j) + Gy (A[fc]). 
From the definition 


G s x, ,y. . (Arti) = max -I(X k+1 Y k+1 -, U) + E XillXiYf, U), (48) 

v [fe+1] v [fc+1] v ifcj; u-x h+ 1 Y h+ 1 -x lk] Y lk] v ^ +1 * +1 ’ “ 


it is clear that ^x fc+1 jYj fe+1] (^[fc]) — G Xk+1 (A^j) + Gy^ +i (A[ fc j) since we can take U to consist of an 
independent pair (Ui, U 2 ) with U\ — X k+ \ — and U 2 — Yk+\ ~ F[ fc j. 

To show the other direction, note that 


2= 1 


-I(X k+1 Y k+1 - U) + ^/(WW; U) (49) 

k 

-/(A fc+1 ;Z7) + ^Ai/(X i; ?7) 

2= 1 

k 

- I(Y k+ 1 ; C/|X fc+1 ) + E XiI(Yf, U\Xi) 

2= 1 

k 

< G5: [fc+1] (A [fc] ) - /(n+i; C^l^fc+i) + E ^fc+il^i) 

2=1 

= G 5r [fc+1] (A W } “ ^*+ 1 ) + E A * J ( y - UX k+1 Xi) (50) 


2=1 

k 


G s X[k+1] (A [fc] ) - I(Y k+ 1 ; l7A fc+1 ) + E A i^; ^+ 1 ) 


2=1 


< G x rfc+11 (A[fc]) + Gf- , (A [fc] ) 


(51) 

(52) 


where in (50) we used the fact that Xj fc+1 ] and F[ fc+1 j are independent; in (51) we used /(Y); Xi\UX k+ {) 
0 which holds because 


I(Yi ; X t | UX k+1 ) < I{YiY k+l ; AQ \ UX k+1 ) 

= I(Yi-Xi\UX k+1 Y k+1 ) + I{Y k+1 -Xi\UX k+1 ) 
< 0 + I(UY k+ i; Xi\X k+ i) 

= I(Y k+ r,Xi\X k+1 ) + I{U-Xi\X k+1 Y k+l ) 

= 0 , 


and finally in (52) we used the Markov chain condition UX k+ 1 — Y k +\ — Yj*,]. 

To show that G x [fe+1 ] (A[fc]) is additive, we follow similar steps. We need to show that if AT[ fc+1 ] 
and Yu. +1 i are independent (but not necessarily identically distributed), then 


G 'v [fc +i] y [fe+i]^ A [ fc ^ “ G Wfc+n('W + ^Yh+i\ ( A l fc ])- 
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From the definition 


Gv, ,Vr ,(Ari.i)= max 

\ lk+1] ^ lk+1] y [k}> f(Xk+iYk+i) 


Var [f(X k+ 1 Y k+ 1 )]+Y^ XiVarXiYii^Xk+iYk+^XiYiif^k+iYk+i)]] 

(53) 


2=1 


it is clear that G^ [fe+i] y [j;+i] (A[ fc ]) > G r x [fc+1] ( A [fe]) + G y [i:+1] ( A [fe]) since we can take / (X k , Y k ) to consist 
of a pair (f(X k+1 ), f(Y k+ i)). 

To show the other direction, note that 

k 

- Var [f(X k+1 Y k+1 )j + £ A, : Var Xi [E Xk+lYk+l]x ^[f(X k+1 Y k+1 )} 

2=1 

k 

= -Vaw i+1 E yi+l|x , + ,[/(X k+1 ii +1 )] + £ AjVar.Yi [^x k+1 Y k+1 Yi\Xi [f (^fe+i^fc+i)] 

2=1 

k 

- Ex, +1 Var n+l|AVl [/(X t+1 n +1 )] + £ AjE Xi Vary.| X . [IE Xfe+1 y fe+1 | Xi y i [/(^fe+i^fe+i)] 

2=1 

< Sx I1+11 (> W ) - Ex w ,Var yi+l|Xl+1 [/(X i+1 n +1 )] 

k 

+ ^2 AjE Xi Vary.| X .[E Xfc+1 y fe+1 | X .y.[/(X fc+ iy' fc+ i)] 

2=1 

< G5t [l+1] (i W ) - Ex l+ ,Var n+l| x i+1 [/(X t+ ,n +1 )] 

k 

+ ^2 AjlE Xfe+lXi Vary.| X . Xfc+1 [IE Y k+1 \XiYiX k+1 [f (^fc+i^fc+i)] (54) 

i =1 

= % +1] ( A w) - E W +1 Var y k+llXk+1 [f(X k+1 Y k+1 )] 

k 

+ ^2 A*IEx fe+ 1 Var r .| Xfc+1 [Ey fc+1 |y. Xfc+1 [f (X k+1 Y k+ 1 )] (55) 

2=1 

- ^x [fc+1] ( A [fe]) + ^y [fe+ i]( A [fc])- 

Here equation ([54]) holds because of property 4 of Table[7]for the choice of A = ( X k+ \, Y k+ 1 ), C = V, 
D = Xj, E = X k+ \. The Markov chain condition that we need to verify is Yi~ A* — X k+ i, which holds 
because X [fc+1] is independent of Y[ k+1 y equation ([55]) holds because Ey fe+1 | Xi y iXfc+1 [f{X k+ {Y k+ ij] 
is equal to ^ Yk+1 \YiX k+1 [f( X k+i Y k+i)\ for (X k+1 , XJ is independent of (Y k+ll Yi). 


C.2 Data processing 

We need to show that Gy (A^j) < G x (A^j) for every p(yi\xi). As before, we prove this in 
two stages: 

Part I (Yi is a function of X t ): Let us start with an algebraic proof of data processing for 
Gy (A[fc]). Take some arbitrary p(u\y k +i). Define 

p(u, x [k+1 ], 2/[fc +1 ]) = p{x[ k+ i],y[k+i])p(u\y k+1 ). 
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Then we have I{Y k+ \-, U ) = I{X k+ 1 ; U) and 7(F); U ) < I(Xf, U). Therefore, 


k k 

-I(Y k+ 1 ; C/) + £ A iI(Y i; U ) < -I(X fe+1 ; 17) + ^ A*/(Af i; U ) (56) 

2=1 2=1 

< GY [fc+1] (A[ fc ]). (57) 

Since this holds for any arbitrary p(u\y k +i), we get the desired result. 

The proof for G'x k+l (A[/ ; ]) is similar. Take some function f(Y k+ 1 ). Then, f{Y k+ 1 ) can be also 
thought of as a function of X k+ \ since Y k+ \ itself is a function of X k+ \. Next, we have 


Var Fi [E Xfc+1 |yj/(y fc+1 )]] < Var Xi yJE Xfc+1 | Xi yJ/(Y fe+1 )]] = Var x jE Xfc+1 | X J/(Y fc+1 )]], 
where the inequality follows from the law of total variance (property 3 of Table [7]). Then, we have 


-Var[/(Y fc+1 )] + X] A i VaryJE yfc+l|y J/(Y fe+1 )]] 


2=1 


k 

< -Var[/(Y fc+ i)] + £ A i Var x jE Xfc+1 | X J/(Y fe+1 )]] 
2=1 

^ ^X [fe+ 1 ] (A[fc])- 

Since this holds for any arbitrary function f(Y k + 1 ), we get the desired result. 


Part II (Yj = (Xi,Ai) where Af s are mutually independent of each other, and of Yjfc+ij): 
We would like to show that 

G x [k+ 1 ] ( x [k]) = G 'A [fe+1] X [ , +1 ] (A[fc])- 

From the additivity of G s for product of independent distributions we have Gyi [fe+1] X[ fc+1 ] (\k]) = 
G x k+1 (A[ fc ]) + G'a ^ +| (A[fc]). Therefore, we need to show that 

G S A [k+1] (\k]) = 0 , 

when j4j’s are mutually independent. 

As before let us begin with the proof of G S A (A^j) = 0. We need to show that for any arbitrary 
p(u\a k +i) we have 


fc 

-I(A k+1 ;U) + Y^*iI(Ai-,U)< 0. (58) 

2=1 

This inequality holds because I{Ap, U) = 0 for i e [k]. 

Now, to show that (A^) = 0, we need to show that for any function f(A k + 1 ) we have 

k 

-Var [f(X k+1 )} + J2 A i VarA i [E Afc+1 , Ai [/(A fe+1 )]] < 0. 

2=1 

From the independence of A, and A k+ 1 we have that E Afc | A . [f(A k+ i)] = 0. Hence, the above 
equation holds. 
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